Introduction to data science in R
Introduction to data visualization
Brian S. Evans, Ph.D.
Migratory Bird Center
Smithsonian Conservation Biology Institute
# Load tidyverse library:
library(tidyverse)
The function paste0 is used to paste two string values together. For example, we can paste the values 'hello' and 'World' together as follows. The resultant object is a nice looking camel case value.
# Load RCurl library:
paste0('hello', 'World')## [1] "helloWorld"
We will use the function paste0 to create an easy-to-read url
gitUrl <-
'https://raw.githubusercontent.com/bsevansunc/'
courseData <-
'smsc_data_science/master/data/'
paste0(
gitUrl,
courseData,
'birdMeasures.csv')## [1] "https://raw.githubusercontent.com/bsevansunc/smsc_data_science/master/data/birdMeasures.csv"
We can use this url to read in the data.
birdMeasures <-
read.csv(
paste0(
gitUrl,
courseData,
'birdMeasures.csv'))
Take a moment to explore the data frame birdMeasures
Let’s read in birdMeasures as a tibble, because there are a lot of data in that file!
birdMeasures <-
dplyr::as_tibble(
read.csv(
paste0(
gitUrl,
courseData,
'birdMeasures.csv')))I don’t like that there are a lot of factors in the data frame. How might we overcome this?
birdMeasures## # A tibble: 5,234 x 11
## id region spp bandNumber enc date mass wing tl age sex
## <fct> <fct> <fct> <fct> <fct> <fct> <dbl> <dbl> <dbl> <fct> <fct>
## 1 g435… Atlan… NOCA 2641-63316 B 2014… 36.7 92 100 AHY M
## 2 c703… Atlan… NOCA 2641-63362 B 2014… 40.4 93 98 SY M
## 3 b264… Atlan… CACH 2710-53995 B 2015… 9.7 60 50 AHY M
## 4 y107… Atlan… AMRO 1352-27606 B 2015… 80.1 130 97 AHY F
## 5 w113… Atlan… AMRO 1352-27609 B 2015… 73.8 130 96 AHY M
## 6 f364… Atlan… NOCA 2641-63899 B 2015… 42.1 86 100 AHY F
## 7 m960… Atlan… NOCA 2641-63900 B 2015… 42.7 92 102 AHY F
## 8 e424… Atlan… AMRO 1352-27610 B 2015… 72.7 130 97 AHY F
## 9 k126… Atlan… AMRO 1352-27614 B 2015… 75 120 87 AHY F
## 10 j492… Atlan… GRCA 2657-47401 B 2015… 38.3 87 90 AHY M
## # ... with 5,224 more rows
I don’t like that there are a lot of factors in the data frame. How might we overcome this?
birdMeasures <-
dplyr::as_tibble(
read.csv(
paste0(
gitUrl,
courseData,
'birdMeasures.csv'),
stringsAsFactors = FALSE))ggplot(data = birdMeasures)Aesthetics describe mapping the value of some variable to an observable feature.
ggplot(
data = birdMeasures,
mapping = aes(x = spp))A geometry plot element provides a visible representation of observations. They are called using the function geom_[geometry]. Geometries are frequently used include:
ggplot(
data = birdMeasures,
mapping = aes(x = spp)) +
geom_bar()
The data = and mapping = were not really necessary. Let’s simplify:
ggplot(
birdMeasures,
aes(x = spp)) +
geom_bar()
The function geom_density can be used to display the density distribution of a vector. Using the aesthetic x = mass, display the distribution of Black-capped chickadee (BCCH) and Carolina chickadee (CACH) mass measurements:
The function geom_density can be used to display the density distribution of a vector. Using the aesthetic x = mass, display the distribution of Black-capped chickadee (BCCH) and Carolina chickadee (CACH) mass measurements:
# Subset birdCounts to BCCH and CACH and plot density:
ggplot(
birdMeasures[birdMeasures$spp %in% c('BCCH', 'CACH'),],
aes(x = mass)) +
geom_density()
ggplot(
birdMeasures[birdMeasures$spp %in% c('BCCH', 'CACH'),],
aes(x = mass)) +
geom_histogram()ggplot(
birdMeasures[birdMeasures$spp %in% c('BCCH', 'CACH'),],
aes(x = mass)) +
geom_histogram(binwidth = 1)ggplot(
birdMeasures[birdMeasures$spp %in% c('BCCH', 'CACH'),],
aes(x = mass)) +
geom_histogram(bins = 20)ggplot(
birdMeasures[birdMeasures$spp != 'NOCA',],
aes(x = spp)) +
geom_bar(fill = 'gray')ggplot(
birdMeasures[birdMeasures$spp != 'NOCA',],
aes(x = spp)) +
geom_bar(fill = 'gray',
color = 'black')ggplot(
birdMeasures[birdMeasures$spp != 'NOCA',],
aes(x = spp)) +
geom_bar(fill = 'gray',
color = 'black',
size = 0.7)Modify your density plot from Exercise One:
fill argument to fill your density shape with the color “gray”:
alpha can be applied to a geometry to adjust its transparency. Adjust the density shape to alpha = 0.7
# Subset birdCounts to BCCH and CACH and plot density:
ggplot(
birdMeasures[birdMeasures$spp %in% c('BCCH', 'CACH'),],
aes(x = mass)) +
geom_density(fill = 'gray', alpha = 0.7)Aesthetics describe mapping the value of some variable to an observable feature.
ggplot(
birdMeasures[birdMeasures$spp != 'NOCA',],
aes(x = spp)) +
geom_bar(aes(fill = region))ggplot(
birdMeasures[birdMeasures$spp %in% c('BCCH', 'CACH'),],
aes(x = mass)) +
geom_histogram(
aes(fill = sex),
bins = 20)ggplot(
birdMeasures[birdMeasures$spp %in% c('BCCH', 'CACH'),],
aes(x = mass)) +
geom_histogram(
aes(fill = sex),
bins = 20,
color = 'black')
Modify your density plot from Exercise One. Use the fill argument of the function geom_density to assign a different fill color to females and males.
# Subset birdCounts to BCCH and CACH and plot density:
ggplot(
birdMeasures[birdMeasures$spp %in% c('BCCH', 'CACH'),],
aes(x = mass)) +
geom_density(
aes(fill = sex),
alpha = 0.7)Faceting splits plots, by some variable, into multiple plots.
ggplot(
birdMeasures[birdMeasures$spp %in% c('BCCH', 'CACH'),],
aes(x = mass)) +
geom_histogram(
aes(fill = sex),
bins = 20,
color = 'black') +
facet_wrap(~spp)ggplot(
birdMeasures[birdMeasures$spp %in% c('BCCH', 'CACH'),],
aes(x = mass)) +
geom_histogram(
aes(fill = sex),
bins = 20,
color = 'black') +
facet_wrap(~spp, nrow = 2)
Modify your density plot from Exercise Three. Use the facet_wrap function with the argument nrow = 2 to generate separate plots of Black-capped and Carolina chickadees.
ggplot(
birdMeasures[birdMeasures$spp %in% c('BCCH', 'CACH'),],
aes(x = mass)) +
geom_density(
aes(fill = sex),
alpha = 0.7) +
facet_wrap(~spp, nrow = 2)Labels describes the plot and axis titles.
ggplot(
birdMeasures[birdMeasures$spp != 'NOCA',],
aes(x = spp)) +
geom_bar(
aes(fill = region),
color = 'black',
size = .7) +
labs(
title = 'Birds banded and recaptured 2000-2017',
x = 'Species',
y = 'Count')Modify the density plot you created in Exercise Four:
The default colors of ggplot are pretty ugly. Luckily you can modify in an infinite number of ways!
ggplot(
birdMeasures[birdMeasures$spp %in% c('BCCH', 'CACH'),],
aes(x = mass)) +
geom_histogram(
aes(fill = sex),
bins = 20,
color = 'black') +
facet_wrap(~spp, nrow = 2) +
scale_fill_manual(values = c('blue', 'red'))Color-picker apps can be a great way to find colors that you like on the internet.
Using Team Zissou’s hat and shirt color:
ggplot(
birdMeasures[birdMeasures$spp %in% c('BCCH', 'CACH'),],
aes(x = mass)) +
geom_histogram(
aes(fill = sex),
bins = 20,
color = 'black') +
facet_wrap(~spp, nrow = 2) +
scale_fill_manual(values = c('#9EB8C5', '#F32017'))You can hunt around to find colors that you like and then save your palette for use later:
zPalette <- c('#9EB8C5', '#F32017')
ggplot(
birdMeasures[birdMeasures$spp %in% c('BCCH', 'CACH'),],
aes(x = mass)) +
geom_histogram(
aes(fill = sex),
bins = 20,
color = 'black') +
facet_wrap(~spp, nrow = 2) +
scale_fill_manual(values = zPalette)
Modify the density plot you created in Exercise Five. Use scale_fill_manual to set custom fill colors.
We can use the scale_fill_manual function from above to modify the legend by specifying the name and label attributes:
ggplot(
birdMeasures[birdMeasures$spp %in% c('BCCH', 'CACH'),],
aes(x = mass)) +
geom_histogram(
aes(fill = sex),
bins = 20,
color = 'black') +
facet_wrap(~spp, nrow = 2) +
scale_fill_manual(values = c('#9EB8C5', '#F32017'),
name = 'Sex',
labels = c('Female', 'Male'))
Modify the density plot you created in Exercise Six. Use scale_fill_manual to set the legend title and labels.